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ABSTRACT 


Each year, roughly 30% of first-year students at US bac- 
calaureate institutions do not return for their second year 
and billions of dollars are spent educating these students. 
Yet, little quantitative research has analyzed the causes and 
possible remedies for student attrition. What’s more, most 
of the previous attempts to model attrition at traditional 
campuses using machine learning have focused on small, ho- 
mogeneous groups of students. In this work, we model stu- 
dent attrition using a dataset that is composed almost exclu- 
sively of information routinely collected for record-keeping 
at a large, public US university. By examining the entirety 
of the university’s student body and not a subset thereof, 
we use one of the largest known datasets for examining at- 
trition at a public US university (N = 66,060). Our results 
show that students’ second year re-enrollment and eventual 
graduation can be accurately predicted based on a single 
year of data (AUROCs = 0.887 and 0.811, respectively). 
We find that demographic data (such as race, gender, etc.) 
and pre-admission data (such as high school academics, en- 
trance exam scores, etc.) - upon which most admissions 
processes are predicated - are not nearly as useful as early 
college performance/transcript data for these predictions. 
These results highlight the potential for data mining to im- 
pact student retention and success at traditional campuses. 


1. INTRODUCTION 


Student attrition has long been a topic of great interest in 
higher education research, with government reports on at- 
trition dating back over 100 years [31]. This interest stems 
from the fact that students who do not graduate are a lost in- 
vestment on many fronts. For higher education institutions, 
limiting attrition is central to their financial sustainability as 
they devote scarce resources towards classes and services for 
non-completing students [17]. In particular, it is estimated 
that 30% of United States (US) first-year students do not re- 
turn for their second year of post-secondary education with 
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US taxpayers spending nearly $2 billion annually on educat- 
ing non-returning first-year students alone [28]. Institutions 
are also concerned with attrition rates because they are cen- 
tral to estimates of institutional effectiveness, thereby af- 
fecting funding opportunities and government support [14]. 
Highlighting the impact of attrition at the institutional level 
also says nothing of its impact on students, who devote time, 
effort, and finances towards unfinished educational pursuits. 
Leaving college drastically alters career trajectories for stu- 
dents and those without college degrees face continually de- 
clining job growth and worsening job prospects [9]. 


In light of this, understanding motivations for students to 
drop out and possible remedies thereof is of great importance 
[12]. Empirical evidence to build student attrition theory 
has traditionally focused on survey-based research [30, 8]. 
However, survey instruments are often costly to implement, 
time-consuming for data collection, and produce results that 
are not always generalizable across institutions due to vastly 
different student profiles [34, 7, 8]. Institutional data that is 
routinely collected at colleges and universities (e.g. student 
application and transcript data) can provide an alternative 
data source and a way to supplement survey-based measures 
[8]. Leveraging data sources already in existence can add 
a means to more efficiently examine the student attrition 
problem and help institutions remedy the issue of attrition. 
One field that is primed to take advantage of this institu- 
tional data is educational data mining (EDM) and its focus 
on data-intensive techniques in educational settings [26, 4]. 


EDM is an emerging field with much of its research on at- 
trition centered on massive online open courses (MOOCs) 
and other online environments (e.g. [35, 13]). Studying at- 
trition in MOOCs and other online settings lends itself to 
expansive data collection opportunities and a detailed mon- 
itoring of students [23]. This limits the extent to which this 
work can be generalized to more traditional campus set- 
tings (i.e. campuses where learning is primarily on-campus, 
in-classroom). Meanwhile, EDM-centric work on predicting 
attrition at traditional campuses has been scarce and usu- 
ally limited to small, homogeneous subsets of students rather 
than the entirety of a college student population. Addition- 
ally, the focus when predicting attrition is usually on how 
well it can be predicted and less so on what type of data is 
best for these predictions. 


In this work, we predict the attrition of a large number of un- 
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dergraduate students (N = 66,060) using only their first year 
of academic data. The students we examine are not from a 
single department or major within a university. Rather, they 
span the entirety of a student body, thereby comprising a 
dataset with heterogeneous aspirations, backgrounds, and 
goals. In addition, we rely almost entirely on data that is 
routinely collected at institutions of higher education. With 
this data, we seek to answer two questions: to what ex- 
tent can undergraduate student attrition be predicted using 
a limited amount of data from registrar records and what 
types of data from registrar records are most useful in pre- 
dicting attrition. The first of these has been explored in the 
past while using smaller and/or homogeneous student pop- 
ulations; the second has not been systematically examined 
in the literature to our knowledge. 


To answer the above questions, we mine the institutional 
data records at a large, public university in the US and 
engineer features for predictions. We then create numer- 
ous machine learning models using the engineered features 
and compare the performance of these models to each other. 
Then, we create separate machine learning models using 
only groups of features and not the entirety of the feature 
space to compare the predictive power of different subsets of 
institutional data. This work is an extension of our previous 
work on modeling student attrition using a limited amount 
of data [3] but where we previously focused on using the first 
term’s data in generating features for prediction, we use the 
first year’s in this work. We also extend our previous work 
to build additional machine learning models, predict attri- 
tion as defined according to two different definitions (overall 
graduation and re-enrollment after students’ first year), and 
examine the types of feature subsets most useful in predic- 
tions. In so doing, we present two key findings, both of 
which have many implications for administrative policy in 
higher education: 


e We demonstrate that the graduation and second-year re- 
enrollment of students can be predicted using data that is 
routinely gathered at institutions of higher education. 

e We show that demographic and pre-entry features have 
less predictive power than data on student academics. 


2. RELATED WORK 


There are many examples of predicting attrition at tradi- 
tional campuses. Most of these focus on small, homogeneous 
subsets of students. Moseley predicted the graduation of 
528 nursing students using rule induction methods, obtain- 
ing high accuracies but not controlling for the number of 
terms/semesters examined for each student [21]. Dekker et 
al looked at only the first semester grades of 648 students 
in the Electrical Engineering department at the Eindhoven 
University of Technology and were able to predict dropout 
with 75-80% accuracy [10]. Kovaéié used tree-based meth- 
ods on a similarly-sized dataset of 453 students at the Open 
Polytechnic of New Zealand, finding ethnicity and students’ 
course taking patterns to be highly useful in prediction [18]. 
Bayer et al. looked at 775 applied informatics students at 
the Czech Republic’s Masaryk University across three years 
[5]. Without limiting the amount of information available 
for each student, they found that including features related 
to students’ social behavior can boost prediction accuracy by 
over 10% for some models. These and similar studies, how- 


ever, focus on relatively small (e.g. N < 2,000) subgroups of 
students with similar academic pursuits/foci. In addition, 
there is little consistency with respect to the timeframes 
across which data is examined for each student. Other 
approaches to predict attrition at traditional campuses in- 
clude early alert systems, which are often labor intensive and 
poorly funded [29]. These alert systems have been shown to 
positively benefit students (e.g. [16]), but usually rely on 
data gathered in the midst of a course or an academic term 
(e.g. [27, 15]), which may not always be feasable. 


The work we present more closely relates to a subset of lit- 
erature looking at student attrition in the context of the 
heterogeneity of students across an entire campus and not 
just a subset thereof. Our work also deals with much larger 
student populations than those described above and, in this 
sense, it more closely resembles a more recent body of litera- 
ture. Delen used 8 years of institutional data on over 25,000 
students at a large, public US university, predicting whether 
the students would return for their second year [11]. How- 
ever, due to class imbalances, Delen re-sampled the majority 
class and ultimately used only 6,454 students for predictions. 
Ram et al. used data on about 6,500 freshmen at a large, 
public US university to predict whether students would drop 
out after their first semester, and for those that did not, 
whether they will drop out after an additional term [25]. 
Ram et al. supplemented data from institutional databases 
with student smart card transactions to infer social inte- 
gration. More recently, Nagy and Molontay predicted the 
dropout of 15,825 students from the Budapest University 
of Technology and Economics using only their information 
prior to college entry with some success [22]. 


There are a few ways in which our work contributes to this 
body of literature. Firstly, we use a much larger dataset 
than has been previously examined specifically for attrition 
(66,060 students). We examine the entirety of a large uni- 
versity’s student body and we do not limit the extent of het- 
erogeneity of the students in the dataset. Additionally, we 
also address the question of what types of features are most 
useful in predicting student attrition. In particular, previous 
works have generally used all available data sources concur- 
rently in determining which students will attrite. In this 
work, we explore what types of routinely-collected institu- 
tional data fare best when predicting attrition by comparing 
performance using different data subsets in isolation. Fi- 
nally, we concurrently compare predictions for two different 
definitions of “attrition,” highlighting the degree to which 
operationalizing the term can impact results. 


3. METHODS 

We describe the methods for this work by first detailing the 
data used in the project. We then give relevant operational 
definitions with respect to how we define attrition. There- 
after, we discuss the data subsets used in the predictions 
and the features generated. Lastly, we describe the setup of 
the machine learning experiments. 


3.1. Data Description 

We collected psuedonomyized, de-identified data from the 
University of Washington (the University) data stewards in 
2017. The University is a traditional campus setting where a 
vast majority of instruction is in person and face-to-face. No 
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personally identifiable information was collected for the stu- 
dents; instead, students were referenced using unique iden- 
tifying keys. Table 1 shows the tables that were pulled from 
the registrar databases. In general, the data included in- 
formation on students’ demographics, complete transcript 
records at the University, and information from applications 
to the University. We did not have any information on stu- 
dents’ financial aid status or economic status other than that 
which was derived from their ZIP code, as described below. 
Socioeconomic factors can play a large role in the student 
attrition process [6], however, we did not have access to stu- 
dent finances for use in this work. We also did not have 
access to any exit surveys from students who had either left 
the University or had graduated. 


Table 1: Data pulled from registrar databases 


Table Description 


Application Data Information from student applica- 
tions to the University including 


high school coursework 


Guardian Data Information on student guardians 
as pulled from student applications 


to the University 


Demographic Data Information on student demograph- 
ics including date of birth, race, 
ethnicity, gender, etc. 


Major Data Information on majors declared 
by students on a _ term-by-term 


(quarter-by-quarter) basis 


Information on student standard- 
ized test results 


Test Score Data 


Information on student coursework 
and grades on a_term-by-term 
(quarter-by-quarter) basis 


Transcript Data 


We restricted data to high school graduates who first en- 
rolled at the University as matriculated, baccalaureate- 
degree-seeking undergraduate students between 1998 and 
2010 without previously attending another post-secondary 
institution full-time. These students are henceforth referred 
to as “freshmen.” The dataset included students who were in 
a college in high school program but excluded those who at- 
tended junior/community college full-time after high school 
and then transferred to the University. Because the data was 
pulled in 2017, we used the year 2010 as a cutoff to allow for 
six full years of visibility on student academics at the Univer- 
sity before labelling a student as a “non-completion,” as de- 
fined in Section 3.2. In total, the dataset consisted of 66,060 
unique freshmen entrants. We then further limited the data 
for each student to information through one calendar year 
from each student’s first enrollment at the University. This 
data was limited to one calendar year for all students, re- 
gardless of the number of courses they took/passed, their 
grades, or their backgrounds. 


After joining tables of interest using the unique student iden- 
tifiers, we created features for the prediction experiments by 
either pulling them directly from the raw data or engineer- 
ing them for each student. The features were grouped in 7 


groupings, which are described in Section 3.3; a comprehen- 
sive list of features and descriptions thereof is available upon 
request but was not provided in this writing in the interest 
of space. In total, there were 1,405 features and all features 
were generated for each student without exception. 


3.2. Definitions 


Ambiguity with respect to operational definitions of dropout 
in literature on student attrition can make it difficult to com- 
pare results across studies [24, 33]. There are numerous ways 
in which attrition has been defined in existing literature, be 
it students dropping out from a particular course (e.g. [21]), 
re-enrolling after their first term (e.g. [1]), re-enrolling after 
their first year (e.g. [11]), graduating on time (e.g. [3]), or 
reaching some other relevant milestone (e.g. [10]). In this 
work, we defined attrition in two ways and analyze both. We 
examined attrition from students’ first year to their second 
(“re-enrollment” and “non-re-enrollment”) as well as looking 
at whether a student graduated on time (“graduate” and 
“non-completion”). We do not examine attrition on a term- 
by-term basis because of the relatively few students who 
leave the University after only a single term, as discussed in 
Section 4.1. We operationally defined non-completion and 
re-enrollment as described below. 


3.2.1 Non-Completion 

We defined “non-completion” as any freshman student who 
did not graduate with a baccalaureate degree from the Uni- 
versity within 6 calendar years of first entry to the Univer- 
sity. We defined a “graduate” as a freshman who graduated 
from the University with a baccalaureate degree within 6 cal- 
endar years of first enrollment. The University uses a quar- 
ter term system and we used the span of four consecutive 
academic quarters as a measure of one calendar year. Six 
calendar years for graduation was thus the span of 24 consec- 
utive academic quarters. This definition of non-completion 
only accounted for students’ first baccalaureate degree and 
did not take into account double-majors or double degrees. 
For example, if a student was simultaneously pursuing two 
baccalaureate degrees but only graduated with one in five 
years, they would be a graduate; alternatively, if the stu- 
dent had graduated with both degrees but during their sev- 
enth year, they would be considered a non-completion. Be- 
cause we focused on registrar records from a single institu- 
tion, defining non-completion in this manner does not take 
into account students’ academic progression after leaving the 
University. This is because we only had access to registrar 
records from a single institution and did not track students 
across multiple institutions - they could have very well trans- 
ferred from the University and graduated in good standing. 


We accounted for students who took part in a college in high 
school program by converting their transferred credit total 
to a count of academic quarters completed while assuming 
typical full-time enrollment at the University. For example, 
if a student completed 30 credits in a college in high school 
program, we converted this credit total to a count of terms 
completed at the University (in this case, 2, as students typ- 
ically take 15 credits per term). We rounded the result from 
this conversion where appropriate. We then deducted this 
number when determining whether the student had gradu- 
ated within an appropriate amount of time. 
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3.2.2 Re-Enrollment 

We defined “re-enrollment” as a student who completed at 
least one additional course within one calendar year of the 
end of their first calendar year at the University (i.e. within 
4 academic quarters from the end of their first year). “Non- 
re-enrollments” were students who were not re-enrollments. 
In this work, the definitions of graduation and re-enrollment 
were treated mutually exclusive in that all graduates were 
not necessarily re-enrollments. It should be noted that the 
University requires students who do not enroll for two con- 
secutive terms without an excused leave to be re-admitted 
at the discretion of the University. 


3.3 Feature Groupings 

For every student, we engineered the subsets of features 
that are described below. For all student grades, we cal- 
culated a grade percentile and a z-score by comparing each 
students’ grades to the grades of all undergraduate students 
who had taken the same course at the same time. References 
to grades include the student’s GPA (on a 4.0 scale), their 
percentile score (from 0-100), and their z-score for courses 
(representing the number of standard deviations from the 
mean, assuming a normal grade distribution). References 
to “performance” for the feature groupings include grades 
and credits earned, at the least. In some cases, references to 
performance may also include the number of graded credits 
earned (versus courses taken pass-fail) and the number of 
credits attempted. A brief description of each of the feature 
subsets is provided in Table 2. 


Table 2: Data subsets used in predictions 


Subset Description 


Base Data Year and quarter of University entry 

(included with every other data subset) 
Demographic Non-academic data prior to entry to the 
Data University, including demographics 
Department- Measures of performance aggregated by 
level Data course department 


First-Year Sum- 
mary Data 


Aggregated measures of academic per- 
formance during first year 


Grouped Course Measures of performance aggregated by 
Data course number and STEM gatekeepers 


Major Data 


Counts of majors declared on a term- 
by-term basis 


Pre-Entry Data Academic data prior to entry to the Uni- 


versity. 


3.3.1 Base Data 


Base data consisted of only three features and was included 
in the feature space when making predictions using every 
other data subset described. The base data included stu- 
dents’ calendar year of entry to the University, their quarter 
of entry to the university (i.e. which of the four academic 
quarters was a student’s first; ranging from 1 to 4, with 1, 2, 
3, and 4 corresponding to winter, spring, summer, and au- 
tumn academic quarters, respectively), and a quarter-year 
variable which consisted of students’ year of entry multiplied 
by 4 and added to the quarter of entry to create a relative 


time scale. These features were included to account for any 
time-related variation in graduation rates. 


3.3.2 Demographic Data 

Demographic data consisted of student’s non-academic in- 
formation prior to entry to the University. This included, 
but was not limited to, students’ gender, race, ethnicity, age 
at college enrollment, veteran status, and student athlete 
status. We also included information from students’ appli- 
cation to the University, such as information on the stu- 
dents’ high schools (excluding high school grades), parents’ 
educational attainment, and students’ ZIP (postal) code, 
which was either pulled from their high school information 
or, when unavailable, from their university application. We 
joined students’ ZIP codes with 2015 US census data! to 
find the average income and educational attainment in each 
ZIP code. We also included the distance from the Univer- 
sity to each student’s home ZIP code. Features derived from 
ZIP codes were the only features from sources external to 
the University’s registrar databases. 


3.3.3 Department-level Data 

Department-level data consisted of student performance in 
course offerings grouped by course prefix. For example, this 
included performance in all BIOL (biology) courses grouped 
together, performance in all HIST (history) courses grouped 
together, etc. We excluded course prefixes wherein at least 
10 students from the dataset did not take a course. In all, 
this included 200 unique course prefixes and 1000 features, 
with GPA, percentile grade, z-score, credits earned, and 
graded credits earned calculated for each prefix. We used 
department-level data instead of individual course data af- 
ter preliminary modeling using individual courses did not 
yield strong results. The expansive feature space when en- 
gineering features across individual courses also significantly 
increased the requisite computational power/time for mod- 
eling and we decided against pursuing this further. 


3.3.4 First-Year Summary Data 

First-year summary data consisted of aggregate measures of 
students’ first year at the University. This included, among 
other things, students’ course performance, credits taken, 
number of courses failed, number of quarters enrolled, and 
enrollment in a freshman seminar courses. The first-year 
summary data also included aggregate measures of students’ 
performance in their first, second, third, and fourth quarters 
as well as student performance in the last academic quarter 
for which they were enrolled during their first year (regard- 
less of which quarter it was). We also included differences 
between students’ performance in successive quarters. 


3.3.5. Grouped Course Data 


Grouped course data consisted of student course perfor- 
mance grouped either by course number or by performance 
in “STEM gatekeepers.” To group courses by course num- 
ber, we aggregated performance across all courses that were 
numbered below 100, from 100-199, from 200-299, from 300- 
399, and 400+. The course numbering generally reflected 
whether the course was designed to be taken by lowerclass- 
men or upperclassmen and, in some cases, also indicated 
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during which year students typically took the course. STEM 
gatekeepers refer to introductory science, technology, engi- 
neering, and math (STEM) courses which often function as 
pre-requisites for STEM majors and degrees. These gate- 
keeper courses tend to be highly competitive and perfor- 
mance in these courses is a key determinant of whether a 
student will be accepted into any of the highly competitive 
STEM majors. We grouped the performance in STEM gate- 
keepers by course department and topic (e.g. the calculus 
series, the general chemistry series, the organic chemistry 
series, etc.) as well as across all STEM gatekeepers. 


3.3.6 Major Data 

Major data consisted of counts of students’ major declara- 
tions during their first academic year. In most cases, stu- 
dents entered the University with a “pre-major” designation 
before declaring their major(s) of interest some time dur- 
ing their first or second year. These pre-major designations 
varied based on field of interest (e.g. pre-engineering, pre- 
nursing, pre-health, etc.). Students’ majors were recorded 
on a per-quarter basis by the University (once per quarterly 
transcript record) and we tallied the counts of major decla- 
rations for each student across the entirety of their first year. 
For example, a student who declared a math major in their 
first two quarters only to switch to geography in their third 
quarter and then add a history double major in their fourth 
quarter would have the values 2, 2, 1 in the math major, 
geography major, and history major features, respectively. 


3.3.7. Pre-Entry Data 

Pre-entry data consisted of students’ academic information 
prior to attending the University. This included, among 
other things, students’ entrance exam scores, high school 
GPA, high school coursework, and college in high school 
program participation and performance. We did not include 
any information on students after their enrollment at the 
University in the pre-entry data. 


3.4 Machine Learning and Predictions 

We randomly divided the students into training and test 
sets using a 80-20 split (N in training = 52,848; N in test = 
13,212). We used the same test set when evaluating the pre- 
dictive performance of each of the models to allow for direct 
comparisons to be made. The data was highly skewed with 
graduates and re-enrollments comprising 78.5% and 93.1% 
of all the data, respectively. Graduates and re-enrollments 
comprised 78.0% and 92.9% of the test data, respectively. 
Though dealing with class imbalances is of great interest 
when examining freshmen attrition [32], we did not use any 
balancing techniques as we wanted to work with the data in 
its original, unaltered form. We scaled the training data by 
subtracting the median of each feature and dividing by the 
respective feature’s interquartile range. We subsequently 
scaled the test data using the scaling values for each feature 
from the training data. 


We used five different machine learning models to predict 
each student’s graduation and re-enrollment: regularized lo- 
gistic regression (LR), K-Nearest Neighbors (KNN), random 
forests (RF), support vector machines (SVM), and gradi- 
ent boosted trees (XGB). We trained each model across the 
entirety of the training data and used the same training 


instances to train each of the models. We trained each 
model separately to predict graduation and re-enrollment. 
We tuned model hyperparameters for each model using 5- 
fold cross validation on the training data, after which the 
models were re-trained on the entirety of the training data 
using the tuned hyperparameters. We report final error met- 
rics and performance on the test set, which was consistent 
across all models, regardless of whether predicting gradua- 
tion or re-enrollment. 


After developing predictive models using all features, we cre- 
ated regularized logistic regression models using each of the 
6 feature subsets highlighted in Section 3.3 in isolation. The 
base data (see Table 2) was included in the feature space for 
each data subset. The rationale behind using regularized lo- 
gistic regression for these models is further discussed in Sec- 
tion 4.3. We understand that an alternative approach would 
be to test all the models listed above for each of the data sub- 
sets to find the best performing model/subset combinations. 
That said, we believe our approach was still suitable for com- 
paring different data subsets. When modeling using data 
subsets, we used the same observations as before to train 
each of the models and, as before, we developed a separate 
model for predicting graduation and re-enrollment for each 
of the data subsets. As such, the training instances were 
the same across models but the training features differed 
depending on the feature subset used. We tuned the reg- 
ularization strength for these regularized logistic regression 
models using 5-fold cross validation on the training dataset 
and we report results on the test set. 


4. RESULTS AND DISCUSSION 
4.1 Student Characteristics 


We show the number and proportion of graduates and re- 
enrollments in Figure 1. In all, 78.5% of students were la- 
belled graduates while 93.1% of students were labelled re- 
enrollments. These proportions were verified with the Uni- 
versity’s office of institutional analysis. Such highly skewed 
data towards graduates and re-enrollments can be expected 
in a large, tier-1 research university setting where there has 
been considerable, long-standing effort to improve the over- 
all attrition rate over time. That said, it must also be noted 
that at an institution with such a large student population, 
even small fractions of the student body represent hundreds 
of students on an annual basis. Across the timeline of the 
dataset (13 cohorts), 14,196 non-completions and 4,593 non- 
re-enrollments represent 1,092 and 351 students on an an- 
nual basis, respectively. 


We show the cumulative percentage of students who either 
graduated or left the University across time in Figure 2. We 
used the first year as a cutoff for the data because, histor- 
ically, a large number of students decide whether they will 
continue with their higher education pursuits during and 
immediately after their first year [28]. As such, developing 
models that can predict whether students will re-enroll for 
a second year and whether they are on a trajectory towards 
successful graduation could help administrators and aca- 
demic advisors more effectively develop and deliver interven- 
tions directed towards students in need of assistance. When 
examining the data, 27.5% of all non-completions leave the 
university prior to the start of their 2nd year, 51.9% of non- 
completions leave the University between their 2nd and 6th 
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Graduates Non-Completions 
N = 51,864 N = 14,196 
78.51% 21.49% 


Re-Enrollments Non-Re-Enrollments 
N= 61,497 N = 4,563 
93.09% 6.91% 


Figure 1: Counts and percentages of classes in the 
dataset. Definitions are provided in Section 3.2. 


year, and 20.6% continued to be enrolled at the University 
after their 6th year. The difference in number between non- 
completions who did not return for their 2nd year and non- 
re-enrollments can be attributed to non-re-enrollments who 
later returned to the University and graduated on time. Less 
than 5% of non-completions and less than 15% of non-re- 
enrollments left the University after only one term, leading 
us to not examine attrition after the first and second terms. 
In settings where attrition rates are higher after students’ 
first and second terms, it may be more relevant to examine 
the performance of classifiers after one or two terms. 


Figure 2 also shows that a majority of graduates (65.6%) 
completed their degrees during their fourth year at the Uni- 
versity. The mean and median completion time for all gradu- 
ates was 16.6 and 15.0 calendar quarters, respectively, from 
first enrollment. This is particularly apparent due to the 
near-sigmoidal shape of the cumulative graph for graduates, 
with a sharp rise during students’ fourth year. We also see 
that there is a relative lack of students who graduated prior 
to the start of their third year. This highlights the difficulty 
in predicting graduation based on students’ first year - a 
student typically does not graduate until several years later, 
during which a host of influences can shape an academic 
trajectory, be they personal, financial, or academic. 


4.2 Predictions Using Different Algorithms 


Table 3: Prediction results using all data features. 
Baseline values are based on test set. 


Graduation Re-Enrollment 
Model Accuracy AUROC Accuracy AUROC 
LR 83.2% 0.811 95.0% 0.882 
RF 83.1% 0.806 95.3% 0.887 
XGB 83.0% 0.806 95.1% 0.885 
KNN 82.5% 0.798 94.8% 0.876 
SVM 78.0% 0.780 92.9% 0.862 


We show the performance of each of the models using the en- 
tirety of the feature space in Table 3. The baseline measure 
in the Table refers to the majority class compositions in the 
test set. Generally speaking, most of the models had a sim- 
ilar comparative performance for each prediction task (i.e. 
predicting either graduation or re-enrollment). This hints at 


Graduates Non-Completions 
Quarter 
JERS HeES BEES REPS eeES GEES 
Ant ant ant Nant Nae Nat 
s wom-] Si Patt PP PP ttt yt bb besta! 
og 
on.g 
$3 80% - 
aco 
eee 
52 60% — 
a8 
2S 40% - 
g& 
23 20% — 
Bn 
0, - 
ov 0% 1 1 i i i I 
Ist 2nd 3rd 4th 5th 6th 
Year in University 
Figure 2: Cumulative graduation and _ non- 


completion curves of students. Years and quarters 
are relative to the time of first enrollment. The dot- 
ted line indicates the point to which data is limited 
for each student. Only students’ first six years are 
shown, per the definition of “graduate.” 


an effective ceiling with respect to predictive power from the 
types of features being used (i.e. ones pulled from registrar 
records) and that additional representations of the student 
experience (be they academic or social) should be incorpo- 
rated. Alternatively, a more complex predictive model (e.g. 
deep neural networks) may also fare better in making these 
predictions. That said, given the data used, the models are 
able to predict the eventual graduation and re-enrollment 
of students fairly successfully, as evidenced by the relative 
improvements over baseline values for both prediction tasks. 


For predicting graduation, logistic regression was the best- 
performing model, followed by random forests. When pre- 
dicting re-enrollment, random forests performed the best, 
followed by gradient boosted trees and logistic regression. 
These results are generally in line with our previous work on 
similar tasks, where we found that logistic regression tends 
to work well compared to other models for predicting grad- 
uation and STEM attrition [2]. When examining the worst- 
performing models, the SVM model made predictions that 
consisted entirely of the majority class when predicting both 
graduation and re-enrollment, as seen by the models’ accu- 
racy being the same as the baseline values. Such results are 
typical of classifiers without much predictive strength on a 
dataset consisting of highly disproportionate classes. In this 
specific case, it may be remedied by using alternate kernels 
for the model, which we did not explore in this work. 


We show the ROC curves for the models in Figure 3. 
These curves further illustrate the lack of differentiation 
with respect to model performance. For the same pre- 
diction task, the resulting ROC curves across the mod- 
els were nearly identical with little difference in curvature. 
The more notable difference was when comparing the ROC 
curves for predicting graduation with those for predicting re- 
enrollment, as the curves for predicting re-enrollment were 
more prominently convex compared to those for predicting 
graduation. These curvatures, along with the metrics shown 
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Figure 3: Receiver operating characteristic curves 
when using different machine learning models. 


in Table 3, demonstrate that predicting students’ eventual 
graduation is a more difficult task than predicting students’ 
re-enrollment. We expected this as the cutoff for the data 
used in the predictions (i.e. students’ first year) was near the 
point at which a student is classified as a re-enrollment (after 
their second year) but was much earlier than when a student 
was classified as a non-completion (after their sixth year). 
This helps highlight the degree to which differing operational 
definitions of attrition can vastly alter the perceived predic- 
tive strength of these classifiers. For other scenarios, alter- 
nate definitions of attrition may be more appropriate and 
the effectiveness of efforts to build predictive models will be 
colored by these definitions and institutional contexts. 


We show the confusion matrices for the best models for 
predicting graduation and re-enrollment (logistic regression 
and random forests, respectively) in Figure 4. These ma- 
trices show a lower rate of false negatives for the models 
but a higher rate of false positives (i.e. students incor- 
rectly classified by the models as having graduated or re- 
enrolled). To better understand this higher rate of false 
positives, we examined the complete transcript records of 
students who were classified accordingly. Across the false 
positives, we found numerous instances of non-completions 
and non-re-enrollments who had left the University with rel- 
atively strong grades in comparison to their graduating and 


Graduation using 
Logistic Regression 


Re-enrollment using 
Random Forests 


a: 
ne] 

2 3 
iene 5S 
ga Yes 0.8. 
3 is} 
ao 0.6 Z. 
ase 3 

ac) 0.4 
pe i 2g, 
<s ° 0.26 

Bc! = 

mt 

ie) oh a 

a Zz 
Predicted Label: 


Graduated/Re-Enrolled? 


Figure 4: Confusion matrices when examining the 
top performing algorithms for predicting graduation 
(LR, left) and re-enrollment (RF, right). 


re-enrolling peers. These students also often appeared to be 
pursuing very competitive majors and/or appeared to have 
rigorous post-graduation plans (e.g. pre-medical and pre- 
dental students). Many of these students remained in a pre- 
major state prior to their departure, indicating that though 
they had relatively strong grades, they likely were not able 
to enter into their degree program(s) of choice for various 
reasons and had to leave the University to pursue these am- 
bitions as a result. Unfortunately, the University does not 
have a centralized major application database for admissions 
and rejections to specific majors. Having so could shed light 
on much of the motivation behind these students’ desire to 
leave the University and if it was, in fact, motivated by not 
getting into competitive majors. That said, the fact that 
many of these students were academically similar to their 
graduating and re-enrolling counterparts further illustrates 
why there appears to be an effective ceiling with respect to 
predictive power using the given data, as seen in Table 3. 


From a practical perspective, it should be noted that the 
classification thresholds for these models were not tuned 
with respect to either sensitivity or specificity. In practice, 
when developing institutional systems to identify students 
at-risk of leaving, it may be useful to raise the classifica- 
tion threshold when predicting whether a student will grad- 
uate or re-enroll, thus favoring lower recall at the expense of 
higher precision. This would effectively reduce the number 
of students who are predicted to graduate but in actuality 
do not (i.e. false positives) at the expense of more false neg- 
atives, which could be more acceptable when developing an 
alert system for students at risk of dropping out. 


4.3 Predictions Using Different Data Subsets 


After examining the results from predicting graduates and 
re-enrollments using all features, we used regularized logis- 
tic regression to predict graduation and re-enrollment using 
subsets of the data. We used logistic regression after we saw 
that it performed very well relative to other models for both 
prediction tasks (see Section 4.2) and because it had rela- 
tively fast training times due to having fewer hyperparam- 
eters to tune. This allowed us to more efficiently train the 
12 different models that were needed when examining the 
performance of specific data subsets (i.e. separately mod- 
eling graduation and re-enrollment while using 6 different 
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Table 4: Prediction results using specific data sub- 
sets. Baseline values are based on test set. 


Graduation Re-Enrollment 
Subset Accuracy AUROC Accuracy AUROC 
Baseline 78.0% 0.500 92.9% 0.500 
All 83.2% 0.811 95.0% 0.882 
FY-Sum. 83.0% 0.795 94.9% 0.855 
Department 82.3% 0.788 94.6% 0.847 
Grouped 82.5% 0.781 94.6% 0.845 
Major 79.9% 0.661 94.2% 0.768 
Demo 78.0% 0.634 92.9% 0.643 
Pre-Entry 77.3% 0.630 92.9% 0.616 


data subsets in isolation for each). 


We show the results when using data subsets in Table 4 
alongside the performance of the logistic regression clas- 
sifier from Section 4.2. Transcript-based features tended 
to perform better than information on students’ prior to 
their enrollment at the University. More specifically, de- 
mographic data and pre-entry information did relatively 
poorly in predicting both graduation and re-enrollment. In- 
tuitively, this is not a surprise as the admissions process at 
highly-competitive universities tends to be fairly selective 
with an emphasis on supporting and sustaining a success- 
ful yet diverse student body. Additionally, such institutions 
may already have efforts in place to reduce demographic 
disparities for student success. Meanwhile, when looking at 
transcript-based data subsets, first-year summary data per- 
formed the best with performance that was similar to using 
the entirety of the data. This is particularly noteworthy as 
the first-year summary data contained fewer features than 
the other transcript-based data subsets but was centered on 
summaries of performance across time rather than aggrega- 
tions across course departments/numberings. 


These findings are particularly interesting in light of work by 
other researchers. For instance, Nagy and Molontay found 
that attrition could be accurately predicted using what we 
outline as demographic and pre-entry features alone [22]. 
However, we do not see similar success here. We believe this 
could be due to vastly different educational settings and stu- 
dent profiles (e.g. here, most students tend to graduate/re- 
enroll while Nagy’s student population primarily dropped 
out). In earlier work, Dekker et al. found that transcript- 
based features tend to have more predictive strength than 
pre-entry features, but examined this across rather limited 
data subsets [10]. Our results echo this finding. Recently, 
Manrique et al. found that attrition could be predicted us- 
ing student performance in a few key courses [20]. Here, 
we find that aggregates across the first year tend to work 
better than more fine-grain representations of course-taking 
(e.g. grouping classes by course prefix and numbering). As 
discussed in Section 3.3.3, we decided against using individ- 
ual course representations in this work. 


We show the ROC curves for the regularized logistic regres- 
sion models using each of the data subsets as well as the 
entire feature space in Figure 5. The fact that demographic 
and pre-entry data gave generally worse performance than 
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Figure 5: Receiver operating characteristic curves 
when using different subsets of data. 


transcript-based features is very much apparent from the 
ROC curves. Data on majors, meanwhile, tended to per- 
form worse than other transcript-based features but better 
than demographic and pre-entry data. The fact that using 
data on majors did not yield particularly strong results likely 
relates to the fact that most students in the dataset were in a 
pre-major state across their first year and formally declared 
their major of interest later in their undergraduate careers. 
As noted above, a centralized major application system was 
not available, else it could have been leveraged in addition to 
data on majors to draw a more clear picture of student aca- 
demic interest. The other transcript-based datasets, mean- 
while, had very similar curvatures for the ROC curves when 
predicting both graduation and re-enrollment. 


We show confusion matrices from using the best-performing 
data subset in Figure 6. The best-performing data subset 
for both prediction tasks was first-year summary data. By 
comparing these confusion matrices to those shown in Fig- 
ure 4, it can be seen that using just a limited subset of 
features tends to classify the data similarly to models built 
on the entirety of the data. This is true not only in terms 
of how effective the models are in making predictions, but 
also with respect to the relatively high rate of false positives 
seen across all four matrices. 


Proceedings of The 12th International Conference on Educational Data Mining (EDM 2019) 16 


Graduation using 
First-Year Summary 


Re-enrollment using 
First-Year Summary 


N FD oc 


MOI Jo UorIodoid 


ooese 


Actual Label 
Graduated/Re-Enrolled? 


Predicted Label: 
Graduated/Re-Enrolled? 


Figure 6: Confusion matrices when examining the 
top performing data subset for predicting gradua- 
tion (left) and re-enrollment (right). The top per- 
forming data subset was the same for both tasks 
(first-year summary data). 


5. FUTURE DIRECTIONS 


We believe the findings regarding the data subsets have 
wide-ranging policy implications, particularly for identifying 
students at risk of dropping out in large, public universities. 
In such settings, there may be longstanding effort to decrease 
demographic disparities with respect to attrition and, as a 
result, transcript records may be more viable as features 
in predictive models than pre-entry/demographic informa- 
tion. Furthermore, these settings may also be resource- 
constrained with respect to time available for staff to hand 
engineer features. In such settings, knowing which features 
would be most predictive of attrition without the need to 
hand-engineer features across the entirety of data available 
to institutions could save time and effort in building mod- 
els. We have had conversations with administrators at the 
University for better interpreting our results and improving 
the processes for identifying students in need of assistance. 


Another direction of interest is better understanding the fea- 
tures used in predicting attrition. This includes not only 
further examining key individual determinants of attrition, 
as we have done in previous work [3, 2], but also finding the 
best combination of features across the subsets. We would 
like to examine this “minimum viable feature space” in the 
context of data available in registrar databases as well as 
investigate the degree to which these features relate to es- 
tablished theory on student attrition [12]. 


6. CONCLUSIONS 


In this work, we use data from the registrar databases of a 
large, public US university to predict both graduation and 
re-enrollment using information limited to students’ first cal- 
endar year at the university. We do this using a dataset of 
students that spans the entirety of the university student 
body and is thus much larger than previous studies predict- 
ing student attrition (N = 66,060). In so doing, we demon- 
strate that both graduation and re-enrollment can be effec- 
tively predicted using features generated from data that is 
routinely collected at institutions of higher education. Addi- 
tionally, we also examine the degree to which specific subsets 
of registrar data can be useful in predicting attrition, finding 
that transcript-based features tend to outperform features 


based on student histories prior to college. This implies that 
effective strategies for intervention can be outlined based on 
registrar records. 


Predicting re-enrollment after students’ first year was a 
much more tractable task than predicting graduation. This 
can be attributed to the fact that predicting graduation ne- 
cessitates predicting academic success years into the future 
from the point to which data was limited whereas predicting 
re-enrollment is within a much shorter timeframe. Consider- 
ing the unpredictable influences that cause students to leave 
college prior to graduating (e.g. financial limitations, per- 
sonal hardships, etc.), a more reliable prediction task may 
be to examine whether a student will return on a term-by- 
term basis. This could be particularly useful to develop alert 
systems to identify students at risk of dropout. However, 
this was not explored in this work due to the relatively few 
students who left the University after a single term. 


We found that there appears to be an upper limit for pre- 
dictive power for our dataset. This demonstrates the limi- 
tations when relying solely on registrar data and shows the 
need for additional features on the student experience to 
improve predictive power. Some potential features of inter- 
est include measures of social integration on campus and of 
financial aid. Better understanding student aspirations be- 
yond simply using declared majors could also be of interest, 
especially using alternate representations of student course- 
taking behavior, as shown recently by Luo and Pardos [19]. 


Lastly, we show that features generated from transcript 
records, particularly aggregates and summaries of students’ 
academics, perform better for predictions than demographic 
and pre-entry data. Much of this is likely due to the selec- 
tivity of the University and its admissions policy. Never- 
theless, it demonstrates how useful transcript data can be 
for such prediction tasks in contrast to information on stu- 
dents prior to college. We demonstrate that using subsets 
of data from registrar databases (in this case, aggregates of 
students’ first year) can be nearly as effective for predictions 
as hand-generating a wide swath of features from different 
institutional data sources. 
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